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Abstract 

Population structure inference with genetic data has been motivated by a variety of applications 
in population genetics and genetic association studies. Several approaches have been proposed for 
the identification of genetic ancestry differences in samples where study participants are assumed 
to be unrelated, including principal components analysis (PC A), multi-dimensional scaling (MDS), 
and model-based methods for proportional ancestry estimation. Many genetic studies, however, 
include individuals with some degree of relatedness, and existing methods for inferring genetic 
ancestry fail in related samples. We present a method, PC-AiR, for robust population structure 
inference in the presence of known or cryptic relatedness. PC-AiR utilizes genome-screen data and 
an efficient algorithm to identify a diverse subset of unrelated individuals that is representative of 
all ancestries in the sample. The PC-AiR method directly performs PCA on the identified ancestry 
representative subset and then predicts components of variation for all remaining individuals based 
on genetic similarities. In simulation studies and in applications to real data from Phase III 
of the HapMap Project, we demonstrate that PC-AiR provides a substantial improvement over 
existing approaches for population structure inference in related samples. We also demonstrate 
significant efficiency gains, where a single axis of variation from PC-AiR provides better prediction 
of ancestry in a variety of structure settings than using ten (or more) components of variation 
from widely used PCA and MDS approaches. Finally, we illustrate that PC-AiR can provide 
improved population stratification correction over existing methods in genetic association studies 
with population structure and relatedness. 
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Introduction 

Ancestry inference with genetic data is an essential component for a variety of applications in 
genetic association studies, population genetics, and both personalized and medical genomics. Ad- 
vances in high-throughput genotyping technology have allowed for an improved understanding of 
continental-level and fine-scale genetic structure of human populations, as well as other organ- 
isms. Principal components analysis (PCA) (Price et al., 2006; Patterson et al., 2006) has been 
the prevailing approach in recent years for both population structure inference and correction 
of population stratification in genome-wide association studies (GWAS) with high-density single 
nucleotide polymorphism (SNP) genotyping data. Other widely used methods for inference on 
genetic ancestry include multi-dimensional scaling (MDS) (Purcell et al., 2007), a dimension reduc- 
tion method similar to PCA, and model-based methods, such as STRUCTURE (Pritchard et al., 
2000), FRAPPE (Tang et al., 2005), and ADMIXTURE (Alexander et al., 2009), for proportional 
ancestry estimation in samples from admixed populations. 

Genetic studies often include related individuals; however, most existing population structure 
inference methods fail in the presence of relatedness. For example, the top principal components 
from PCA, as well as the top dimensions from MDS, can reflect family relatedness rather than 
population structure when applied to samples that include relatives (Price et al., 2010). Model- 
based ancestry estimation methods similarly fail in the presence of relatedness as they are not 
able to distinguish between ancestral groups and clusters of relatives (Thornton and Bermejo, 
2014). For certain family-based study designs with known pedigrees, the population structure 
inference method proposed by Zhu et al. (2008), where SNP loadings calculated from a PCA on 
pedigree founders are used to obtain principal components values for genotyped offspring, can be 
used. However, this approach, which we refer to as "FamPCA," fails in the presence of cryptic 
or misspecified relatedness and is not applicable to most GWAS where genealogical information 
on sample individuals is often incomplete or unavailable. The FamPCA method requires genotype 
data to be available for pedigree founders, which can be prohibitive for many genetic studies. In 
addition, inference on population structure is limited to the ancestries in the subset of genotyped 
founders, which may lack sufficient diversity to be representative of the ancestries in the entire 
sample (Chen et al., 2013). 
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We address the problem of population structure inference and correction in samples with related 
individuals. We do not put constraints on how the individuals might be related, and we allow for 
the possibility that genealogical information on sample individuals could be partially or completely 
missing. We propose a method, which we call PC-AiR (principal components analysis in related 
samples), for inference on population structure from SNP genotype data in general samples with 
related individuals. The PC-AiR method implements a fast and efficient algorithm for the identifi- 
cation of a diverse subset of mutually unrelated individuals who are representative of the ancestries 
in the entire sample. Axes of variation are inferred using this ancestry representative subset, and 
coordinates along the axes are predicted for all remaining sample individuals based on genetic sim- 
ilarities with individuals in the ancestry representative subset. The top axes of variation (principal 
components) from PC-AiR are constructed to be both representative of ancestry and robust to 
both known or cryptic relatedness in the sample. A remarkable feature of PC-AiR is the method's 
ability to identify a diverse and representative subset of individuals for ancestry inference using only 
genome-screen data from the sample, without requiring additional samples from external reference 
population panels or genealogical information on the study individuals. 

We assess the robustness and accuracy of PC-AiR for inference on genetic ancestry in simulation 
studies with both related and unrelated individuals under various types of population structure 
settings, including admixture. We also directly compare PC-AiR to existing population structure 
inference methods using both simulated data and real genotype data collected from the Mexican 
Americans in Los Angeles, California (MXL) and African American individuals in the southwestern 
USA (ASW) population samples of release 3 of phase III of the International Haplotype Map 
Project (HapMap) (International HapMap 3 Consortium, 2010). The population structure inference 
methods to which we compare PC-AiR are: (1) PCA with the EIGENSOFT (Price et al, 2006) 
software, (2) MDS with the PLINK (Purcell et al., 2007) software, (3) the model-based ancestry 
estimation methods FRAPPE (Tang et al., 2005) and ADMIXTURE (Alexander et al., 2009), and 
(4) FamPCA (Zhu et al., 2008) as implemented in the KING (Manichaikul et al., 2010) software. 
We also perform simulation studies to assess population structure correction with PC-AiR in GWAS 
with relatedness and ancestry admixture. We evaluate the type-I error when using PC-AiR principal 
components as well as widely used population stratification correction methods including: (1) the 
EIGENSTRAT (Price et al., 2006) method, which uses PCA with EIGENSOFT to correct for 
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population structure, and (2) the linear mixed model methods EMMAX (Kang et al., 2010) and 
GEMMA (Zhou and Stephens, 2012), which use variance components and an empirical genetic 
relatedness matrix to simultaneously account for both population structure and relatedness among 
sample individuals. 

Materials & Methods 
Overview of the PC-AiR Method 

Let the set Af be a sample of outbred individuals who have been genotyped in a genome-screen. 
An essential component of the PC-AiR method for population structure inference in the presence 
of relatedness is to use genome-screen data to partition M into two non-overlapping subsets, IA and 
1Z, i.e. N = IA u 1Z with IA n 1Z = 0, where IA is a subset of mutually unrelated individuals who are 
representative of the ancestries of all individuals in J\f, and 1Z is a "related subset" of individuals who 
have at least one relative in IA. We allow for individuals in 7Z to be related to each other in addition 
to having relatives in U. PC-AiR uses measures of pairwise relatedness and ancestry divergence 
calculated from autosomal SNP genotype data for the identification of U, without requiring external 
reference panels or genealogical information. Population structure inference on the entire set of 
sample individuals, J\f, is then obtained by first directly performing PCA on the selected ancestry 
representative subset, IA, and then predicting values along the components of variation for all 
individuals in the related subset, 7Z, based on genetic similarities with the individuals in IA. In the 
following subsections, we describe the PC-AiR method in detail. 

Population Genetic Modeling Assumptions 

The population genetic modeling assumptions we make are weak and are satisfied by commonly used 
models of population structure, such as the Balding-Nichols model (Balding and Nichols, 1995). 
The individuals in set M are assumed to have been sampled from a population with ancestry derived 
from K ancestral subpopulations. Let S be the set of autosomal SNPs in the genome-screen, and for 
SNP s e S, denote p s = (pi, . . . ,pf) T to be the vector of subpopulation-specific allele frequencies, 
where p 1 ^ is the allele frequency at SNP s in subpopulation k e {1, . . . , K}. We assume that the p$ 
are random variables that are independent across s but with possible dependence across the fc's, with 



Downloaded from http://biorxiv.org/on September 18, 2014 

6 

mean E[p s ] = p s l and covariance Cov[p s ] = p s {l—p s )JlK for every s, where 1 is a length K column 
vector of l's, and 'Sk is a K x K matrix. In genetic models incorporating population structure, 
the allele frequency parameter p s is typically interpreted as an "ancestral" allele frequency, or 
some average of allele frequencies across subpopulations. Although we allow to be completely 
general, including allowing for non-zero covariances across subpopulations, a special case is the 
Balding-Nichols model, where is a diagonal matrix with (k, k)-th element equal to F}. ^ 0, 
and .Ffc is Wright's standardized measure of variation (Wright, 1949) for subpopulation k. We 
allow for sample individuals to have admixed ancestry from the K subpopulations, and we denote 
an = (a}, . . . ,af) T to be the ancestry vector for individual i e Af, where a 1 ? is the proportion of 
ancestry across the autosomal chromosomes from subpopulation k for individual i, with ^ 0 for 
all k, and 2*Li a i = 1- ^ n mos t contexts, the parameters K, S^-, p s and p s for all s e S, and aj 
for all i e M will be unknown. The goal of PC-AiR is to obtain inference on ancestry, i.e. the a^'s, 
for all sample individuals i e M in the presence of known or cryptic relatedness. 

Relatedness Inference in Structured Populations 

PC-AiR uses kinship coefficients to measure relatedness between all pairs of individuals in A/", 
where the kinship coefficient for individuals i and j, which we denote as (frij, is defined to be the 
probability that a random allele selected from i and a random allele selected from j at a locus are 
identical- by-descent (IBD). When the genealogy of the sample individuals is known, PC-AiR can 
use theoretical or pedigree-based kinship coefficients, and a number of software packages (Abney, 
2009; Zheng and Bourgain, 2009) are available for calculating these according to a specified ge- 
nealogy. However, genealogical information on sample individual is often unknown, incomplete, or 
misspecified, and PC-AiR can also use empirical kinship coefficients estimated from genome-screen 
data for samples with cryptic relatedness that must be genetically inferred. It is important to note 
that relatedness estimators that assume population homogeneity, such as those implemented in the 
widely used PLINK software (Purcell et al., 2007) or obtained via a standard genetic relationship 
matrix (GRM)(Yang et al., 2010), are biased in samples from structured populations. Therefore, 
we do not recommended using these estimators with PC-AiR as it has been demonstrated that 
they give inflated kinship estimates in the presence of population structure (Thornton et al., 2012; 
Manichaikul et al., 2010), where (1) unrelated pairs of individuals with similar ancestry can have 



Downloaded from http://biorxiv.org/on September 18, 2014 

7 

kinship-coefficient estimates corresponding to values that are expected for close relatives, and (2) 
related individuals can have a systematic inflation in their estimated degree of relatedness. 

To use the PC-AiR method when pedigree relationships are unknown or incomplete, we rec- 
ommend using empirical kinship coefficient estimates from methods that have been developed for 
samples from structured populations. One such estimator is KING (kinship-based inference for 
GWASs)-robust (Manichaikul et al., 2010). Rather than using estimated allele frequencies, which 
leads to biased relatedness estimates in the presence of population structure, KING-robust relies 
on shared genotype counts across the SNPs in the genome-screen to measure the genetic distance 
between individuals. KING-robust was developed for relatedness inference in samples from popula- 
tions with discrete substructure without admixture, and it is a consistent estimator of the kinship 
coefficient for a pair of outbred individuals from the same subpopulation. The estimator, however, 
will generally be negatively biased for pairs of individuals that have different ancestries. Despite 
this bias, the KING-robust estimator is typically able to separate close relatives with similar an- 
cestry from unrelated individuals, which is often sufficient for the PC-AiR method. Additionally, 
the PC-AiR method exploits the negative bias of the KING-robust estimator to gain insight on 
ancestry differences among individuals, as discussed in more detail in the following subsection. 

Estimated kinship coefficients from the recently proposed REAP (Thornton et al., 2012) and 
RelateAdmix (Moltke and Albrechtsen, 2014) methods can also be used by PC-AiR. Both of these 
methods offer improved relatedness inference over KING-robust in samples with admixed ancestry 
by using external reference panels. REAP and RelateAdmix, however, may not be suitable for some 
studies as they require (1) some prior knowledge about the ancestries that are likely present in the 
sample, and (2) appropriate reference panels with suitable surrogates for the ancestral subpopula- 
tions. KING-robust does not require external reference panels and can be used with PC-AiR for 
admixed samples with cryptic relatedness when the REAP and RelateAdmix methods may not be 
practical. 

Measuring Ancestry Divergence with Genome-Screen Data 

Pairwise measures of relatedness, such as kinship coefficients, among individuals in a sample can 
be used for selecting a subset of mutually unrelated individuals (Staples et al., 2013). In structured 
samples, however, identifying a subset of unrelated individuals based solely on relatedness measures 
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can result in a subset that lacks sufficient diversity for population structure inference on the entire 
sample, as it may not be representative of the ancestries of all individuals. For the identification of 
an ancestry representative subset of mutually unrelated individuals, PC-AiR incorporates measures 
of ancestry divergence in addition to the kinship coefficients used as measures of relatedness. 

Consider a pair of individuals i, j e M who have non- missing genotype data at the set c S 
of autosomal SNPs in a genome-screen, and let \Sij\ denote the total number of SNPs in this set. 
Additionally, let the random variables gi s and gj s be the number of copies of the reference allele 
that individuals i and j each have, respectively, at SNP s e Sij] thus, gi s and gj s take values of 0, 
1, or 2. To measure ancestry divergence between a pair of unrelated individuals i and j, we use the 
estimator 

1 / Y.^c .(a™ - Pi*) 2 \ 

(1) 



EseSij ( 1 [ff i a=l] + 1 fea=l]J, 
where l[ 5 . s=1 ] is an indicator for individual i being heterozygous at SNP s, i.e. lr g . s= i] is 1 if gi s = 1 
and is 0 otherwise, and . j= ]j is similarly defined for individual j. Equation (1) is equivalent to the 
KING-robust estimator (Manichaikul et al., 2010) that has been proposed for estimating kinship 
coefficients of related individuals in samples from discrete subpopulations. We consider the KING- 
robust estimator under the general population genetic modeling assumptions previously discussed 
for i and j with admixed ancestry from K ancestral subpopulations. Recall that a^ and aj are the 
ancestry vectors for i and j, respectively. Under an assumption that genotypes at different SNPs 
are independent and with \Sij\ —* oo, it can be shown (see the Appendix) that 

~ _ -a i ) T S x (a i - aj ) 



For unrelated i and j with the same ancestry, Kij — > 0, as can be seen from Equation (2) by setting 
a, = aj. However, when i and j have different ancestral backgrounds, Kij is a negatively biased 
estimator of kinship, and this bias provides a useful measure of the ancestry divergence between 
the pair of individuals. Consider, for example, the Balding-Nichols model where is a diagonal 
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matrix with (k, k)-th element equal to F k . Under this model, it can be seen from Equation (2) that 

-\t{<-a)?F k 

" ~ ' -. (3) 



K 

fc=l 



K*) 2 + (a*) 2 



The value in Equation (3) is negative when % and j have different ancestry proportions, and the 
magnitude of this negative value will depend on how divergent the ancestries for the pair are. 
The Kij estimator will generally have more extreme negative values when (1) the F k values are 
large, (2) i and j have large ancestry proportion differences, or (3) either i or j has an ancestry 
proportion that is close to 1 from one of the K subpopulations. For the special case when i and j 
are non-admixed and have ancestry from different subpopulations k and k', the estimator reaches 
an extreme negative value and 

-\{F k + F k ,) 



v l-\{F k + F k ,)' 

PC-AiR uses the estimator given by Equation (1) for inference on ancestry divergence for all 
pairs of individuals i,j e N who are not inferred to be related based on the kinship coefficient 
measures discussed in the previous subsection. 

Identification of an Ancestry Representative Subset 

We now provide details on how PC-AiR uses both the relatedness and ancestry divergence measures 
discussed in the previous two subsections for the identification of U, a mutually unrelated subset 
of individuals that is representative of the ancestries of all individuals in the sample N '. Let cftij 
be the kinship coefficient measure that is chosen for relatedness inference on a pair of individuals 
i,j e M. When the genealogy of the sample individuals is known, <fiij could be a pedigree-based 
kinship coefficient, and when the genealogy is partially or completely unknown, cf)^ would be 
an empirical kinship coefficient estimate from a relatedness estimation method that allows for 
population structure, e.g., the KING-robust estimator of Equation (1) , REAP, or RelateAdmix. 
In order to identify all pairs of relatives in jV, a relatedness threshold, r^, is chosen such that i and 
j are designated to be related by the PC-AiR method if faj > t$. When pedigree-based kinship 
coefficients are used with PC-AiR, all unrelated pairs will have 4>ij = 0, and should be set to 0. 
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When empirical kinship coefficient estimates are used, there will be some noise in the estimation, 
and can be set to an approximate upper bound that is expected for the chosen kinship coefficient 
estimator for an unrelated pair. For example, when using KING-robust for relatedness inference, 
i.e. using (ftij = Kij, we have found that 0.025 is an approximate upper bound with dense SNP 
genotyping data for unrelated pairs with the same ancestry, and setting = 0.025 works well in 
practice for identifying relatives with similar ancestry up to third-degree (and some fourth-degree) 
in a variety of population structure settings with ancestry admixture. For all sample individuals 
i e M, we calculate 7^ = Yjj^i ^ij^ty- > T ] as a measure °f the total kinship individual i has with 
its inferred relatives in the sample, where 1^.. >T j is the indicator that individual j is inferred to 
be individual i's relative. 

PC-AiR infers ancestry divergence using k^ for all pairs of individuals i,j e J\f who are not 
inferred to be relatives. We showed that consistent estimator of 0 for unrelated pairs 

with the same ancestry, while unrelated pairs with different ancestry will have Kij values that are 
systematically negative. We define a pair of individuals i and j to be "divergent" if they have 
different ancestral backgrounds, i.e. Kij < —t k , where — r K is the expected lower bound of Kij for 
a pair of unrelated individuals with the same ancestry. Since the distribution of Kij for unrelated 
pairs with the same ancestry will be symmetric around 0, we expect that the vast majority of these 
pairs will satisfy Kij e [—0.025,0.025] when \Sij\ is large, where 0.025 is the previously mentioned 
approximate upper bound for unrelated pairs. We have found that setting — t k = —0.025 works 
well in practice for identifying unrelated pairs of individuals with different admixed ancestries. For 
all sample individuals i e J\f, we calculate Si = Y!,-j,-l r 7 . - . n , the number of divergent 
ancestry pairs that individual % is a member of. Small Si values generally correspond to individuals 
with ancestry that is similar to the ancestries of many other individuals in J\f, while the highest Si 
values generally correspond to individuals with unique ancestry and/or individuals with an ancestry 
proportion close to 1 from some subpopulation. Individuals with the highest Si values are given 
priority for inclusion in IA as they help to ensure that the subset is both diverse and representative 
of all ancestries in the sample. 

The algorithm used by PC-AiR for partitioning the set N into subsets IA and 1Z based on 
measures of ancestry divergence and kinship is presented below: 
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(1) Compute: ji = £ ^i 1 ^^] for a11 * E ^ ■ 

jejV 

(2) Compute: 6 t = £ 1 [<? ~ T(J for all i e TV. 

(3) Initialize the two subsets to be U = j\f and TZ = 0, where 0 is the empty set. 



(4) Compute: rji 



^ [4>ii>T</,] 



0 Vi e 7e 

If max(i]i) > 0, continue to step (5), otherwise go to step (11). 

i 

(5) Identify 71 = = max(r/j)}, the subset of individuals in U with the most relatives in U. 

j 

If 1 71 1 > 1, where \T\\ is the number of elements in 71, go to step (6). Otherwise set 7~* = 71 
and go to step (9). 

(6) Identify 7i = = min(<5,-)}, the subset of individuals in 71 that are members of the least 
divergent ancestry pairs. 

If 1 7i | > 1, go to step (7). Otherwise set 7~* = Ti and go to step (9). 

(7) Identify 73 = = min(7,)}, the subset of individuals in 7i that have the minimum total 
kinship with their inferred relatives. 

If 1 75 1 > 1, go to step (8). Otherwise set T* = 7-j and go to step (9). 

(8) Randomly select one element from 7% and define this element to be the set T* ■ 

(9) Define the sets: W = U\T* and TZ* = 1Z u 7~*. 

(10) Update U =U* and TZ = TZ* and return to step (4). 

(11) The algorithm has completed. 

This algorithm is both fast and efficient, and the two subsets returned from the algorithm are the 
ancestry representative and mutually unrelated subset, U, and the related subset, TZ, where each 
individual in TZ has at least one relative in IA. The algorithm is constructed in such a way that for 
any subset of mutually related individuals in j\f, one individual in the subset will be chosen to be 
included in U, with priority given to the individual who is a member of the most divergent ancestry 
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pairs. This helps to ensure that every ancestry in N is represented by some individual(s) in 14, while 
simultaneously satisfying the requirement that individuals in IA are also mutually unrelated. It also 
favors selecting individuals with the highest ancestry proportions from each of the K subpopulations 
for U, which helps to avoid shrinkage in prediction of principal component values for individuals 
in 1Z, as these individuals will be at the extremes of the K — 1 dimensional space spanned by the 
axes of variation representing the ancestries in J\f. Secondary priority for inclusion in U is given 
to individuals that share the most genetic information with their collection of relatives in A/", also 
allowing for better prediction of principal component values for relatives in 1Z. 

Genetic Similarity Matrix for PC-AiR 

The traditional PC A approach for population structure inference with genetic data, e.g., the 
EIGENSOFT method, performs PCA on standardized genotypes, where the standardized geno- 
type value for individual % at SNP s is given by 



and p s will typically be an allele frequency estimate for SNP s calculated using all sample indi- 
viduals. The PC-AiR method also uses standardized genotypes, but the allele frequencies used 
for the standardization are calculated using only the unrelated individuals selected for IA. The 
standardized genotype values for PC-AiR are calculated from Equation (5) by setting p s = p%, 
where 



IA S is the subset of individuals in U who have non-missing genotype data at SNP s, and \U S \ is 
the number of individuals in U s . In samples with related individuals and population structure, we 
have found that using the estimator p" provides better ancestry inference with PC-AiR than using 
allele frequency estimates calculated from the entire sample, which can be heavily influenced by 
the correlated genotypes among relatives. For any individual % e M with a missing genotype value 
at SNP s, Zi s is set to 0, i.e. gi s is set equal to 2p", an estimate of its expected value. 

Similar minor allele frequency filtering and LD pruning of SNPs that have been recommended 




s 



(6) 
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for standard PCA (Price et al., 2006; Patterson et al., 2006) should also be used for PC-AiR. Let 
|<S*| be the number of SNPs in the pruned and filtered set S* , and let n, n u , and n r be the number 
of individuals in set M and subsets IA and 7Z, respectively, with n = n u + n r . We construct Z, an 
n x \S* \ standardized genotype matrix for M, with (i, s)-th entry equal to Z{ s , ordered such that the 
first n u rows correspond to individuals in IA, and the remaining n r rows correspond to individuals 
in 1Z. The standardized genotype matrix for IA is the n u x submatrix Z u corresponding to the 
first n u rows of Z. Similarly, the n r x submatrix Z r is the standardized genotype matrix for 1Z 
corresponding to the last n r rows of Z. 

Similar to the traditional PCA approach, PC-AiR obtains a genetic similarity matrix (GSM) 
for population structure inference from standardized genotypes. It is important to note that PCA 
applied to a GSM that includes all individuals in M, as in the traditional PCA approaches, leads 
to artifactual principal components for ancestry due to confounding from correlated genotypes 
among relatives, i.e. genetic similarities are reflecting alleles shared IBD among relatives. To 
protect against confounding caused by sample relatedness, PC-AiR instead calculates a GSM using 
only the mutually unrelated sample individuals who were selected to be included in the ancestry 
representative subset, IA. The empirical n u x n u GSM for U calculated with the standardized 
genotype matrix Z u is 

= ^Z-uZL (7) 

and the (i,j)-th entry of ty u provides a measure of the average genetic similarity across the auto- 
somes for individuals i,jeU. 

Population Structure Inference in Related Samples with PC-AiR 

To obtain principal components that are ancestry representative on a set N containing related 
individuals, the PC-AiR method first performs a PCA using genome-screen data from only those 
individuals selected to be in the mutually unrelated ancestry representative subset, IA. PCA is 
performed by obtaining the eigendecomposition of the GSM ^f u from Equation (7). This procedure 
sequentially identifies orthogonal axes of variation, i.e. linear combinations of SNPs, that best 
explain the genotypic variability amongst the individuals in U, where each axis of variation reflects 
the structure that leads to the greatest variability after accounting for the structure explained by 
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all previously defined axes. Trie eigendecomposition of results in an ti u x ti u matrix, V w — 
[V", V2 , • • • , ], with orthogonal, length n u , column vectors, and a corresponding length n u 
vector, A n = (A^ , X^,- - ■ , A^J, with the property of A" > A| > . . . > X% u - For d e {1, . . . , n u }, 
and A^ are the corresponding d th principal component (eigenvector) and eigenvalue of where 
X^ is proportional to the percentage of variability in the genome-screen data for U that is explained 
by V^. By construction, individuals in U are mutually unrelated and have diverse ancestry, so the 
top principal components of Vl/ U are expected to be representative of ancestry. 

Once PCA has been performed on U, principal components values for individuals in the related 
subset, 1Z, can be obtained via prediction. Let \- u be a diagonal matrix created from the vector of 
eigenvalues, i.e. L u = diag(A u ). An \S*\ x n u SNP weight matrix giving the relative influence of 
each SNP on each of the n u eigenvectors can be obtained as W u = ZjV u , and from the form of the 
eigendecomposition of the real symmetric matrix ty u = V^L^N/^ 1 , it can be shown (Heath et al., 
2008) that the principal components for the ancestry representative subset, U, can alternatively be 
written as: 



(8) 



For the related subset, 1Z, the PC-AiR method predicts principal components values from Equa- 
tion (8) by replacing Z u , the standardized genotype matrix for individuals in U, with Z r , the 
standardized genotype matrix for individuals in 1Z. The matrix of predicted eigenvectors 

for 7Z, which we denote as Q r , is thus given by: 



(9) 



The d th column in the matrix Q r corresponds to PC-AiR's predicted coordinates along the d th 
principal component for the individuals in 1Z. We define T to be the nx n u matrix of the combined 
principal components for IA and 1Z, where: 



Vu 






V! ; • 


1 y« 

• 1 v n 


Qr 






Q r 2 ■ 


■ q; 



(10) 
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The column vectors of T are the principal components (axes of variation) of the set N = U u 1Z 
obtained from the PC-AiR method. The genetic structure that is reflected by all of the principal 
components for PC-AiR are found using only the ancestry representative subset, U, and thus the 
top principal components from T are designed to be representative of ancestry in Af, even in the 
presence of known or cryptic relatedness. 

Simulation Studies 

We perform simulation studies in which both population and pedigree structure are simultaneously 
present in order to (1) assess the accuracy and robustness of the PC-AiR method for population 
structure inference in the presence of relatedness, (2) evaluate correction for population stratifi- 
cation with PC-AiR in genetic association studies with cryptic structure, and (3) compare the 
performance of PC-AiR to existing methods. We simulate a variety of population structure set- 
tings, including admixture and ancestry-related assortative mating, with differentiation between 
populations ranging from subtle to large. We evaluate population structure inference for four 
different relationship configurations, where each configuration corresponds to a specific setting of 
genealogical relationships among the sample individuals. In all simulation studies considered, pedi- 
gree information on the sample individuals is hidden and genetic relatedness is inferred from the 
genotype data with the PC-AiR method using the KING-robust kinship estimator in Equation (1). 

Population Structure Settings 

The population structure settings we consider are similar to the settings in Price et al. (2006), 
where PCA was performed with the EIGENSOFT software in unrelated samples for inference on 
and adjustment for population structure in GWAS, except that our simulation studies include 
related individuals. We consider population structure settings where individuals have ancestry 
derived from two populations, and the allele frequencies at 100,000 SNPs for each of these two 
populations are generated using the Balding-Nichols model (Balding and Nichols, 1995). More 
precisely, for each SNP s, the allele frequency p s in the ancestral population is drawn from a 
uniform distribution on [0.1, 0.9], and the allele frequency in population k e {1,2} is drawn from a 
beta distribution with parameters p s (l — Fk)/Fk and (1 — p s )(l — F^/Fk, where the quantity Fk 
is equivalent to Wright's measure of population differentiation (Wright, 1949) from the ancestral 



Downloaded from http://biorxiv.org/on September 18, 2014 

16 

population. In all simulations, we set F\ and Fi equal to a common value, F$t- To generate allele 
frequencies derived from populations ranging from closely related to highly divergent, we consider 
Fst values from 0.01 to 0.2. 

For each F$t value considered, we simulate three population structure settings. Population 
structures I and II both consist of individuals sampled from an admixed population formed from 
populations 1 and 2. For population structure I, all unrelated individuals and pedigree founders 
have ancestry proportions a from population 1 and (1 — a) from population 2, with the parameter 
a for each individual drawn from a uniform distribution on [0, 1]. Population structure II is similar 
to population structure I, but with the ancestry parameter, a, drawn from a beta distribution with 
mean 0.4 and standard deviation 0.1 for 50% of the unrelated individuals and pedigree founders, 
and with mean 0.6 and standard deviation 0.1 for the other 50%. All founders within the same 
pedigree have a drawn from the same beta distribution for population structure II. Population 
structure III consists of non-admixed individuals, where 50% of the unrelated individuals and 
pedigrees are sampled from population 1, and the other 50% are sampled from population 2. Both 
population structure settings II and III have ancestry-related assortative mating, i.e., the mating of 
founder individuals in every pedigree occurs with individuals who have either the same (population 
structure III) or similar (population structure II) ancestry, while population structure I has random 
mating that is independent of ancestry. 

Relationship Configurations 

Three of the four relationship configurations simulated include both related and unrelated individ- 
uals. Relationship configuration I consists of 200 unrelated individuals and 200 individuals from 10 
four-generation pedigrees, where each pedigree has a total of 20 individuals (Figure SI). Relation- 
ship configuration II is comprised of 280 unrelated individuals with 20 parent-offspring trios, and 
relationship configuration III includes 260 unrelated individuals with 20 sibling pairs. To sample 
pedigree relationships within a given setting of population structure, we simulate genotypes for 
pedigree founders under Hardy- Weinberg equilibrium (HWE) according to the chosen population 
structure setting and then drop alleles down the pedigree. Relationship configuration IV consists 
of 320 unrelated individuals without any family structure. We include the unrelated sample setting 
in our simulation studies in order to evaluate any loss in population structure inference with the 
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PC-AiR method compared to standard PCA in a setting where standard PCA is appropriate and 
has been previously demonstrated to perform well. 

Results 

Subtle Population and Pedigree Structure 

We first considered samples with subtle population structure, where the ancestry of the sample 
individuals was derived from two closely related populations. We set F$t to 0.01 (a typical value 
for divergent European populations) and generated genotype data under population structure I 
for each of the four relationship configurations. Population structure inference with PC-AiR was 
compared to that of standard PCA with the EIGENSOFT software. To assess the performance of 
the two methods, we included the top principal components (axes of variation) from each method as 
predictors for the true simulated ancestry of the sample individuals in a linear regression model, and 
the proportion of ancestry explained, as measured by R 2 , was used to evaluate prediction accuracy. 
We also compared the efficiency of PC-AiR to EIGENSOFT by assessing the number of top axes 
of variation required to attain an R 2 of at least 0.99 for ancestry. It should be noted that since 
the data in the simulation studies contained only one added dimension of population structure, 
an optimal method would require only a single axis of variation for complete ancestry inference. 
Both PC-AiR and EIGENSOFT were provided only genotype data without any additional pedigree 
information on the sample individuals. 

Figure 1 displays the population structure inference results for relationship configuration I from 
both PC-AiR and EIGENSOFT. Figure IB displays the top two axes of variation obtained by 
EIGENSOFT, which almost entirely reflected pedigree structure in the sample. The ten spikes 
of points radiating from the center cluster in the figure correspond to the individuals who are 
members of the ten pedigrees, and the cluster of points in the center of the plot corresponds to 
the 200 individuals who do not have any relatives in the sample. In contrast, the top two axes 
of variation from PC-AiR were not confounded by family structure, as illustrated in Figure 1A, 
and the top axis explained ancestry in the sample nearly perfectly, with an R 2 of 0.993 (Figure 
1C). Figure ID shows that the top axis of variation from EIGENSOFT did not reflect population 
structure and did not adequately capture the ancestry of the sample individuals, with an R 2 of only 
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0.133. The efficiency for population structure inference of both methods is illustrated in Figure 
IE, where the proportion of ancestry explained (R 2 values) for each of the top axes of variation is 
displayed. EIGENSOFT required the top 51 axes to be included as predictors in a linear regression 
model to achieve an R 2 of at least 0.99 for ancestry. In contrast, a single axis of variation from PC- 
AiR had an R 2 greater than 0.99, thus demonstrating a substantial improvement in efficiency with 
PC-AiR over EIGENSOFT in this setting with both subtle population structure and relatedness. 

Population structure inference results with PC-AiR and EIGENSOFT for relationship configu- 
rations II and III are presented in Figures S2 and S3. The top axes of variation from EIGENSOFT 
were influenced by relatedness, as expected; however, since relationship configurations II and III 
have substantially less pedigree structure than relationship configuration I, there was some improve- 
ment in ancestry prediction with the top axis in each of these two settings, with R 2 values of 0.870 
and 0.933, respectively. For both relationship configurations II and III, the top 21 axes of variation 
from EIGENSOFT were required to attain an R 2 of at least 0.99 for predicting ancestry. In com- 
parison, the PC-AiR analysis was robust to the relatedness in the sample, and the single top axis 
of variation for both relationship configurations II and III attained an R 2 value greater than 0.99 
for predicting ancestry. For relationship configuration IV, PC-AiR accurately identified all sample 
individuals to be unrelated, i.e. the ancestry informative subset, U, was the entire sample, jV, so 
the PC-AiR method reduced to standard PCA, and inference with either PC-AiR or EIGENSOFT 
was essentially identical. The R 2 for ancestry with the top axis of variation from both methods was 
greater than 0.99, illustrating that there is no loss in accuracy or efficiency compared to standard 
PCA when using PC-AiR for population structure inference in samples where all individuals are 
unrelated. 

We also evaluated the performance of PC-AiR and EIGENSOFT under population structures 
II and III with F$t set to 0.01 for each of the relationship configurations. The results are given in 
Table 1, and the conclusions drawn from these population structure settings are the same as those 
for population structure I. For the three relationship configurations that included related samples, 
a single axis of variation from PC-AiR fully explained the ancestry in the sample and provided 
better prediction of ancestry than using ten (or more) axes from EIGENSOFT. For relationship 
configuration IV, where all sample individuals were unrelated, PC-AiR and EIGENSOFT gave 
essentially identical results, with the top axis from both methods fully explaining the true ancestry. 
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Relatedness and Admixture from Divergent Populations 

We also conducted simulation studies with relatedness and admixture from divergent populations. 
We considered relationship configuration I and population structure II, where we set F$t to 0.1 
(a value representative of continental-level ancestry differences) in the Balding-Nichols model to 
simulate allele frequencies at SNPs derived from two divergent populations. We evaluated and 
compared the performance of PC-AiR to PCA with the EIGENSOFT software, MDS with the 
PLINK software, and the two model-based methods ADMIXTURE and FRAPPE for proportional 
ancestry estimation. As in the previous subsection, no genealogical information on the sample 
individuals was provided to any of the analysis methods, so the FamPCA method could not be used 
as it is restricted to settings with known pedigrees. The ADMIXTURE and FRAPPE software 
analyses were conducted with the correct number of populations specified. 

The population structure inference results for each method considered are shown in Figure 2, 
where each panel is a plot of the simulated population 1 ancestry proportions against the inferred 
ancestry from one of the methods. The top axis of variation from PC-AiR had an R 2 of 0.998 and 
provided nearly perfect inference on ancestry for the sample individuals (Figure 2A). Similar to the 
EIGENSOFT results for the simulations with subtle population structure and relatedness, the top 
axis of variation did not adequately reflect the ancestry in this related sample with admixture from 
divergent populations, attaining an R 2 of only 0.741 (Figure 2B). ADMIXTURE and FRAPPE 
gave identical ancestry proportion estimates for all individuals in the simulation, and Figure 2D 
shows estimated proportional ancestry plotted against the simulated ancestry proportions from 
population 1. These model-based ancestry estimation methods were confounded by the pedigree 
structure in the sample and performed similarly to PCA, with an R 2 of only 0.730. While the top 
dimension of MDS achieved an R 2 of 0.785 and provided some improvement in predicting ancestry 
over both of the model-based methods as well as the top axis of variation from EIGENSOFT, it 
was also confounded by sample relatedness, as shown in Figure 2C. 

We also evaluated the performance of PC-AiR and EIGENSOFT for all combinations of rela- 
tionship configurations and population structure settings with F$t set to 0.1 and 0.2 (Table 1). 
For all settings considered, the top axis of variation from PC-AiR gave nearly perfect ancestry in- 
ference, attaining an R 2 > 0.99. The extent to which EIGENSOFT's PCA was confounded by the 
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relatedness depended on how divergent the populations were, i.e. the F$t values, and how complex 
the pedigree structure was; however, a single axis of variation from PC-AiR always performed as 
well as or better than using ten axes of variation from EIGENSOFT for ancestry prediction. 

Ancestry Inference in Related Samples with Reference Panels 

Reference population panels are commonly used for improved ancestry inference in unrelated sam- 
ples from admixed populations, such as African Americans and Hispanics. We conducted a sim- 
ulation study evaluating population structure inference with reference panels in admixed samples 
with relatedness. We considered the same simulation study discussed in detail in the previous 
subsection, but we now included reference panels consisting of 50 unrelated individuals randomly 
sampled from each of the two populations. The same population structure methods from the pre- 
vious subsection were used, and the results are displayed in Figure S4. Ancestry inference with 
EIGENSOFT, MDS, ADMIXTURE, and FRAPPE was substantially improved by including the 
reference panels as compared to the analyses without them, but PC-AiR still outperformed all 
methods, with the top axis of variation achieving an R 2 of 0.999 with ancestry. The supervised 
analyses with ADMIXTURE and FRAPPE including the reference panels gave identical results to 
each other, as in the unsupervised analyses without reference panels, and the estimated ancestry 
proportions had an R 2 of 0.973 with the simulated ancestries. Similarly, the top axis of variation 
from each of EIGENSOFT and MDS reached R 2 values of 0.970 and 0.979 respectively. 

Interestingly, the top axis of variation from PC-AiR without additional reference population 
samples had an R 2 of 0.998 and provided better ancestry prediction than all of the competing 
methods with the reference panels. Even with the inclusion of reference panels, there remains some 
bias in ancestry inference for all methods, except for PC-AiR, that is induced by the presence of 
related individuals in the sample. This can be seen in Figure S4, where the inferred ancestries 
for individuals with relatives in the sample were systematically biased for each of the competing 
methods. We have found that using ADMIXTURE (or FRAPPE) to conduct separate individual 
ancestry analyses for each of the admixed sample individuals with the reference panels can remove 
the bias caused by sample relatedness, known or cryptic, as long as the reference panel samples are 
appropriate surrogates for the underlying populations. We performed separate individual ancestry 
analyses with ADMIXTURE for each sample individual, where each analysis included genotype 
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data from a single admixed sample individual and all individuals in the two reference population 
panels, and the estimated ancestries attained an R 2 of 0.999, the same as PC-AiR. 

Correcting for Structure in Genetic Association Studies 

We also performed simulation studies to compare population structure correction in genetic asso- 
ciation studies with PC-AiR to existing approaches. Allele frequencies were generated at 100,000 
null SNPs for two ancestral populations with F$t set to 0.1. We define D s = \p\ — p 2 \ to be the 
absolute difference in the reference allele frequencies between ancestral populations 1 and 2 at SNP 
s. We also define three classes of SNPs based on D s , where SNPs with D s < 0.2, 0.2 < D s < .4, 
and D s ^ 0.4 were considered to have "low differentiation," "moderate differentiation," and "high 
differentiation," respectively. Of these 100,000 SNPs, approximately 70% had low differentiation, 
25% were moderate, and 5% were highly differentiated. Genotype data was generated under pop- 
ulation structure II for sample individuals related according to relationship configuration I, and 
for each individual i in the sample, a quantitative trait value m was simulated according to the 
model Ui = g% + 2aj + £j, where aj is the genome-wide ancestry proportion from population 1 for 
individual i, gi is the number of alleles individual i has at the causal SNP, and ei ~ N(0, 1) is a 
random environmental effect assumed to be acting independently on individuals. The frequency of 
the selected casual variant in populations 1 and 2 was 0.13 and 0.17, respectively. 

The following statistical methods were evaluated for genetic association testing: (1) linear 
regression without ancestry adjustment, (2) EIGENSTRAT, (3) linear regression with principal 
components from PC-AiR included as fixed effects, (4) GEMMA (Zhou and Stephens, 2012) and 
EMMAX (Kang et al., 2010), which are "exact" and "approximate" linear mixed effects model 
methods, respectively, that use an empirical genetic relatedness matrix to capture both population 
and pedigree sample structure, and (5) EMMAX with principal components from EIGENSOFT or 
PC-AiR included as fixed effects. For the association analyses, each null SNP was included as a 
fixed effect in the statistical models and was tested for association with the simulated quantitative 
trait. The genomic control inflation factor (Devlin and Roeder, 1999) Xqc was used to evaluate con- 
founding due to unaccounted for sample structure, where Xqc * 1 indicates appropriate correction 
for population and family structure, while Xqc > 1 indicates elevated type-I error. 

The results of the simulations are given in Table 2. As expected, all of the association tests using 
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linear regression models have inflated type 1 error since these methods either (1) do not account for 
any of the sample structure, or (2) account for population structure but not relatedness. Including 
a single principal component from PC-AiR in the linear regression model results in a lower Xqq 
compared to EIGENSTRAT with the top ten principal components for all classes of SNPs. This is 
because the top PC from PC-AiR nearly perfectly explains ancestry (R 2 = 0.998), while the top 10 
PCs from EIGENSOFT have an R 2 of only 0.672 for ancestry clS cl result of the relatedness in the 
sample. The mixed model approaches considered, EMMAX and GEMMA, are also not properly 
calibrated, with \qc > 1 for SNPs with moderate to high differentiation, and Xgc < 1 f° r SNPs 
with low differentiation. EMMAX with the top ten PCs from EIGENSOFT included as fixed effects 
is still not properly calibrated due to incomplete correction for population stratification. However, 
including a single PC from PC-AiR as a fixed effect with EMMAX results in appropriate calibration 
of the association test statistics, with Xqc = 1 for all classes of SNPs. 

Population Structure Inference in Admixed HapMap Samples 
HapMap MXL Data 

We analyzed high-density genotype data from the Mexican Americans in Los Angeles, California 
(MXL) population sample of HapMap 3 for population structure inference. We applied PC-AiR, 
EIGENSOFT, MDS, ADMIXTURE, and FamPCA to the 86 genotyped individuals, and we com- 
pared the population structure inference results of these methods to a supervised individual ancestry 
estimation analysis with ADMIXTURE that included continental reference population panels. For 
the supervised analysis with ADMIXTURE, the number of ancestral populations was set to 3, 
for which the HapMap CEU (Utah residents with ancestry from northern and western Europe 
from the Centre d'Etude du Polymorphisme Human collection) and YRI (Yoruba in Ibadan, Nige- 
ria) samples were included as the reference population panels for European and African ancestry, 
respectively, and for which the Human Genome Diversity Project (HGDP) (Li et al., 2008) sam- 
ples from the Americas were included for Native American ancestry. The analyses were based 
on 150,872 autosomal SNPs that were genotyped in both the HapMap and HGDP datasets. To 
protect against potential confounding due to relatedness in the supervised ancestry analysis, a sep- 
arate ADMIXTURE analysis was conducted for each of the HapMap MXL individuals, where each 
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analysis included a single HapMap MXL individual and the reference population panels. All meth- 
ods, except for FamPCA, were only provided the SNP genotype data on the sample individuals for 
population structure inference, without any additional information on the pedigree relationships. 
The FamPCA method was also provided the documented pedigrees in the HapMap MXL which 
includes 24 genotyped trios, 5 families with two genotyped individuals, and 4 families with a single 
genotyped individual. The PC-AiR method used the KING-robust kinship coefficient estimator in 
Equation (1) and the relatedness threshold = 0.025 to infer genetic relatedness in the sample, 
and a MAF filter of 5% was used on SNPs for population structure inference. 

Figure 3F presents a bar plot of the results from the supervised individual ADMIXTURE 
ancestry analysis. In the bar plot of ancestry proportion estimates, individuals (vertical bars) are 
arranged in increasing order (left to right) of genome-wide European ancestry proportion. Our 
proportional ancestry estimates were similar to the results from a previous supervised analysis of 
this data (Thornton et al., 2012; Gravel et al., 2013). HapMap MXL individuals have modest 
African ancestry with little variation, with a mean of 6% and a standard deviation (SD) of 1.8%. 
The sample individuals are largely derived from European and Native American ancestry, with 
means of 49.9% (SD=14.8%) and 44.1% (SD=14.8%) respectively. Since the European and Native 
American ancestry proportions are predominant, nearly perfectly negatively correlated (with a 
correlation of -0.99), and quite variable, ranging from 18.0% to 91.0% and from 4.2% to 80.4% 
respectively, we expected that an optimal population structure inference method would require 
only a single axis of variation to explain these two ancestries in the HapMap MXL. 

The population structure inference results for European and Native American ancestry in the 
HapMap MXL are given in Table 3. PC-AiR's top axis of variation was nearly perfectly correlated 
with European (and Native American) ancestry, as estimated from the supervised individual AD- 
MIXTURE ancestry analysis, with an R 2 of 0.98 (Figure 3A). In contrast, the top axis of variation 
from each of EIGENSOFT, FamPCA, and MDS had an R 2 for European ancestry of only 0.66, 
0.65, and 0.75 respectively. For the unsupervised ADMIXTURE analysis that did not include ref- 
erence panels, the highest R 2 for either European or Native American ancestry with any estimated 
ancestry component was only 0.64. Figures 3B, 3C, 3D, and 3E illustrate that ancestry inference in 
the HapMap MXL for each of these competing methods was confounded by relatedness, including 
the FamPCA method, which was provided the documented pedigree relationships. Ancestry infer- 
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ence with FamPCA was confounded by cryptic relatedness present in the HapMap MXL including 
a previously reported (Thornton et al., 2012) extended pedigree consisting of two smaller docu- 
mented pedigrees, which we have labeled in Figure 3 as MXL Extended Family 1. Without being 
provided any pedigree information, a single axis of variation from PC-AiR gave better prediction 
of both European and Native American ancestry than the top ten axes from EIGENSOFT, MDS, 
and FamPCA, as shown in Table 3. Remarkably, the top axis of variation from PC-AiR without 
using any reference population samples gave comparable ancestry inference on European and Na- 
tive American ancestry to a supervised ancestry analysis that included reference panels, similar to 
the results from the simulation studies. 

Combined HapMap ASW and MXL Data 

To evaluate the performance of the population structure inference methods in an admixed popula- 
tion structure setting with three predominant continental ancestries and relatedness, we considered 
an analysis of the combined HapMap ASW (African American individuals in the southwestern 
USA) and MXL samples. Similar to our ancestry estimation analysis of the HapMap MXL, we 
also conducted a supervised individual ADMIXTURE analysis for the 87 genotyped individuals 
in the HapMap ASW with reference population panels included for European, Native American, 
and African ancestries. Figure S5A shows a barplot of the results from the supervised individual 
ADMIXTURE ancestry analysis of the HapMap MXL and ASW samples, which illustrates that 
these populations have very different ancestral backgrounds. Most of the HapMap ASW ancestry 
is African, with a mean of 77.5% (SD=8.4%). There is also a large European ancestry component, 
with a mean of 20.5% (SD=7.9%); however, unlike the HapMap MXL, there is very little Native 
American ancestry in the HapMap ASW, with a mean of only 1.9% (SD=3.5%). Since there are 
three predominant continental ancestries in the combined HapMap ASW and MXL samples, we 
expected that an optimal method would require two axes of variation to fully explain continental 
population structure. 

We applied each of the dimension reduction methods (i.e. PC-AiR, EIGENSOFT, MDS, and 
FamPCA) to the combined HapMap ASW and MXL samples and compared the results to the 
supervised individual ancestry analysis with ADMIXTURE that included the reference population 
panels; results are shown in Table 3. All of the methods were able to fully explain the African 
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ancestry with two axes of variation, achieving R 2 values greater than 0.99. For European ancestry, 
PC-AiR's top two axes of variation achieved an R 2 value of 0.99, while the top two axes from each 
of the competing population structure methods had R 2 values less than 0.90. With an R 2 value 
greater than 0.99, PC-AiR's top two axes of variation also explained Native American ancestry 
better than the top two axes from EIGENSOFT, MDS, and FamPCA, with corresponding R 2 
values of 0.95, 0.96, and 0.95, respectively. These results are illustrated in Figure 4, where we can 
see that the top two axes of variation from each of these methods, except PC-AiR, were confounded 
by relatedness. In fact, the top ten axes of variation from EIGENSOFT, MDS, and FamPCA were 
highly confounded by pedigree structure, whereas axes beyond the top two from PC-AiR did not 
represent any identifiable structure and appear to be noise (Figures S6 - S10). As a consequence, 
the top ten axes of variation from both EIGENSOFT and MDS were not able to explain European 
and Native American ancestry as well as the top two axes from PC-AiR. Intersetingly, FamPCA 
required ten axes of variation to match PC-AiR's top two, despite FamPCA being provided the 
documented pedigree information for both the HapMap MXL and ASW samples (Table 3). PC- 
AiR appropriately accounted for both the known and cryptic relatedness in the sample for optimal 
and efficient inference on ancestry with only two axes of variation. 

We also performed an unsupervised ancestry analysis with ADMIXTURE and FRAPPE without 
including reference panel samples and we compared the results to the supervised ADMIXTURE 
analysis. ADMIXTURE and FRAPPE performed identically to each other, as expected, and a 
barplot of the estimated ancestry proportions from the unsupervised ancestry analysis is given in 
Figure S5B. Two of the three components of ancestry essentially distinguish the ASW from the 
MXL samples, while the third was completely confounded by pedigree structure. This estimated 
ancestry components were able to attain an R 2 value of 0.99 for African ancestry, but the R 2 values 
were only 0.87 for Native American ancestry and 0.62 for European ancestry, thus performing the 
worst of all the methods for ancestry inference in the combined HapMap MXL and ASW samples. 

Assessment of Computation Time 

The computation time for PC-AiR depends on both the sample size and the number of markers 
being analyzed. To analyze a simulated sample of 800 individuals, where 400 individuals are from 
20 pedigrees and the remaining 400 individuals are unrelated, with 100K, 50K, and 20K SNPs 
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required 28.5s, 14.8s, and 6.3s, respectively, on a 2.5 GHz laptop with 8 GB memory. The PC- 
AiR analysis of the HapMap data with 150,872 SNPs required 1.8s for the MXL sample with 86 
individuals and 3.9s for the combined ASW and MXL sample with 173 individuals. All computation 
times refer to the time to run the PC-AiR algorithm, and do not include the time to estimate the 
measure of relatedness and divergence. The KING software implements a highly efficient algorithm 
for obtaining relatedness/divergence estimates, and evaluating millions of pairs of individuals in a 
sample can be conducted in a matter of minutes. 

Discussion 

Genetic ancestry inference has been motivated by a variety of applications in population genet- 
ics, genetic association studies, and other genomic research areas. Advancements in array-based 
genotyping technologies have largely facilitated the investigation of genetic diversity at remark- 
ably high levels of detail, and a variety of methods have been proposed for the identification of 
genetic ancestry differences among unrelated sample individuals using high-density genome-screen 
data. It is common, however, for genetic studies to have sample structure that is due to both 
population stratification and relatedness, and existing population structure inference methods can 
fail in related samples. We develop PC-AiR, a method for robust population structure inference 
in the presence of known or cryptic relatedness. PC-AiR applies a computationally efficient al- 
gorithm that uses pairwise measures of kinship and ancestry divergence from genome-screen data 
for the identification of a diverse subset of mutually unrelated individuals that is representative of 
the ancestries in the entire sample. Principal components that are representative of ancestry are 
obtained by performing PCA directly on genotype data from individuals selected for the ancestry 
representative subset, while coordinates along the axes of variation for the remaining individuals in 
the sample are predicted based on genetic similarities with the diverse subset. The PC-AiR method 
does not require the genealogy of the sampled individuals to be known, and it can be used across a 
variety of study designs, ranging from population based studies where individuals are assumed to 
be unrelated to family based studies with partially or completely unknown pedigrees. 

In simulation studies with a broad range of population structure settings, including ancestry 
admixture, and with sample individuals related according to a variety of genealogical configurations, 
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we demonstrated that the top axes of variation from PC-AiR were nearly perfectly correlated 
with ancestry. In contrast, widely used methods for population structure inference performed 
poorly in the presence of relatedness, including the PCA method implemented in the EIGENSOFT 
software, MDS as implemented in PLINK software, and model-based ancestry estimation methods 
ADMIXTURE and FRAPPE. We also applied PC-AiR and competing methods to the admixed 
HapMap MXL and ASW population samples. Without using any reference population panels 
or pedigree information on the sample individuals, the top two axes of variation from PC-AiR 
nearly perfectly explained proportional European, Native American, and African ancestry in the 
HapMap MXL and ASW samples as compared to a supervised individual ancestry analysis with 
ADMIXTURE that included reference population panels. In contrast, all other population structure 
inference methods were confounded by relatedness, including the FamPCA method which was 
provided the documented pedigree relationships. 

Performing PCA with genome- wide SNP weights that are calculated from external reference 
panels has recently been proposed (Chen et al., 2013) for certain admixed populations. This 
approach requires prior knowledge about the ancestries of the individuals in the sample, which 
may be partially or completely unknown, as well as having available reference panels that are 
adequate surrogates for ancestry. Nevertheless, the PC-AiR method can also easily incorporate 
SNP-weights from external reference panels for population structure inference. For example, by 
designating population samples from external reference panels to be the ancestry representative 
subset in the PC-AiR algorithm, principal components for individuals in the target sample for 
population structure inference will be calculated based solely on SNP weights from the reference 
panels. A potential limitation of using SNP weights from external reference panels, however, is 
that inference on population structure will be limited to the ancestries of individuals selected from 
the panels, which may not be representative of the ancestries of all individuals in the sample. 
An attractive alternative approach would be to perform a PC-AiR analysis on the study sample 
combined with the external reference panels, where genome-screen data would be used by the 
algorithm implemented in PC-AiR for the identification of an ancestry representative subset from 
the combined set of individuals, and where ancestries from both the reference panels and the sample 
will be allowed to contribute to the SNP weights. 

Linear mixed models (LMMs) have recently emerged as a powerful and effective approach for 
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association mapping in samples with population structure as well as family structure or cryptic 
relatedness (Yang et al., 2014). LMMs have previously been evaluated in samples with subtle 
population structure (Price et al., 2010; Wu et al., 2011) and have been shown to have appro- 
priate control over type-I error. We evaluated the performance of LMMs in simulation studies 
where sample individuals have ancestry derived from divergent populations, and our simulation 
results showed that widely used LMM approaches for association mapping, such as EMMAX and 
GEMMA, can have an increase in type-I error due to under-correction of SNPs with moderate to 
high differentiation in allele frequencies between ancestral population, as well as a loss of power due 
to overcorrection of SNPs with little to no differentiation. This result illustrates potential problems 
with existing LMM approaches for association mapping in recently admixed populations, where a 
large proportion of SNPs are expected to have substantial allele frequency differences between the 
underlying ancestral populations. For example, African Americans have genetic contributions from 
European and African ancestral populations, and in a comparative analysis of allele frequencies at 
1.4 million autosomal SNPs for European (CEU) and West African (YRI) samples in HapMap, we 
found that approximately 10% of the SNPs were highly differentiated, with allele frequency differ- 
ences greater than 0.4, while 26% were moderately differentiated, with allele frequency differences 
between 0.2 and 0.4. Our simulation studies also illustrated that including principal components 
from PC-AiR as fixed effects in LMMs resulted in appropriate calibration of association test statis- 
tics at all SNPs in related admixed samples, protecting against inflated type-I error at highly and 
moderately differentiated SNPs. 

The challenges of inferring genetic ancestry in related samples have been well documented 
(Patterson et al., 2006; Price et al., 2010). To our knowledge, PC-AiR is the first method to 
provide robust population structure inference and correction in the presence of known or cryptic 
relatedness without requiring reference population panels, external SNP loadings, or genealogical 
information on the sample individuals. We have implemented the PC-AiR method in an R package 
that is freely downloadable (see Web Resources). 
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Figure Titles and Legends 

Figure 1. Comparison of PC-AiR and EIGENSOFT for Data Simulated under Relationship 
Configuration I and Population Structure I with F$t = 0.01. 

(A and B) Scatter plots of principal components 1 and 2 from PC-AiR (A) and EIGENSOFT (B), 
respectively. 

(C and D) Scatter plots of the simulated population 1 ancestry proportions vs. coordinates along 
principal component 1 for each individual from PC-AiR (C) and EIGENSOFT (D), respectively. 
(A-D) The color of each point represents that individual's true ancestry; red for population 1, blue 
for population 2, and an intermediate color for an admixed individual. 

(A and C) A dot represents an individual in the mutually unrelated ancestry representative set, 
and a plus represents an individual in the related set. 

(B and D) A circle represents an individual not in a pedigree, and a triangle represents an individual 
who is a member of a pedigree. 

(E) Barplot of the efficiency of PC-AiR and EIGENSOFT. Each bar represents the proportion of 
ancestry explained (R 2 value) by each principal component from PC-AiR (gold) and EIGENSOFT 
(black), until a cumulative R 2 of 0.99 is achieved. 
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Figure 2. Comparison of Population Structure Inference Results for Data Simulated under Rela- 
tionship Configuration I and Population Structure II with F$t = 0.1. 

Scatter plots of the simulated population 1 ancestry proportions for each individual are plotted 
against: (A) coordinates along principal component 1 from PC-AiR, (B) coordinates along princi- 
pal component 1 from EIGENSOFT, (C) coordinates along dimension 1 from MDS, and (D) the 
estimated ancestry proportions from ADMIXTURE for the inferred population with the highest 
R 2 . The color of each point represents that individual's true ancestry; red for population 1, blue 
for population 2, and an intermediate color for an admixed individual. 

(A) A dot represents an individual in the mutually unrelated ancestry representative set, and a 
plus represents an individual in the related set. 

(B-D) A circle represents an individual not in a pedigree, and a triangle represents an individual 
who is a member of a pedigree. 

Figure 3. Comparison of Population Structure Inference for the HapMap MXL Sample. 
(F) Individual ancestry estimates for 86 HapMap MXL samples from a supervised individual an- 
cestry analysis with ADMIXTURE. Each individual is represented by a vertical bar; estimated 
European (HapMap CEU), African (HapMap YRI), and Native American (HGDP samples from 
the Americas) ancestry proportions are shown in blue, red, and green, respectively. 
Scatter plots of the European ancestry proportions estimated from a supervised individual ancestry 
analysis with ADMIXTURE for each individual are plotted against: (A) coordinates along princi- 
pal component 1 from PC-AiR, (B) coordinates along principal component 1 from EIGENSOFT, 
(C) coordinates along principal component 1 from FamPCA, (D) coordinates along dimension 1 
from MDS, and (E) the estimated ancestry proportions from an unsupervised analysis with AD- 
MIXTURE for the inferred population with the highest R 2 . The color of each point represents that 
individual's ancestry as estimated from the supervised individual ancestry analysis with ADMIX- 
TURE; blue for European, green for Native American, and an intermediate color for an admixed 
individual. Individuals who are members of MXL Extended Family 1 or 2 are plotted as triangles 
or squares, respectively, and remaining individuals are plotted as circles. 

Figure 4. Comparison of Population Structure Inference for the HapMap MXL and ASW Com- 
bined Sample. 

Scatter plots of the top two axes of variation from PC-AiR (A), EIGENSOFT (B), FamPCA (C), 
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and MDS (D). The color of each point represents that individual's ancestry as estimated from a su- 
pervised individual ancestry analysis with ADMIXTURE; blue for European (HapMap CEU), red 
for African (HapMap YRI), green for Native American (HGDP samples from the Americas), and 
an intermediated color for an admixed individual. Individuals who are members of MXL Extended 
Family 1 or ASW Extended Family 1 are plotted as triangles or stars, respectively, and remaining 
individuals are plotted as circles. 
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Figure 1: Comparison of PC-AiR and EIGENSOFT for Data Simulated under Relationship 
Configuration I and Population Structure I with F$t = 0.01 
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Figure 2: Comparison of Population Structure Inference Results for Data Simulated under Rela- 
tionship Configuration I and Population Structure II with F$t = 0.1 
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Figure 3: Comparison of Population Structure Inference for the HapMap MXL Sample 



European Ancestry vs. PC-AiR PC1 



pap 



<P8 



T 



European 
Native American 



-0.3 



-0.2 -0.1 0.0 0.1 



I r~ 

0.2 0.3 



~ r 

0.4 



PC-AiR PC1 
European Ancestry vs. MDS Dimension 1 




B 



European Ancestry vs. EIGENSOFT PC1 




* MXL Extended Family 1 
■ MXL Extended Family 2 
o Other 



"~ I — 

-0.1 



— I — 

0.0 



0.1 



0.2 



—I 

0.3 



""I 

0.4 



EIGENSOFT PC1 
European Ancestry vs. ADMIXTURE Ancestry 




European Ancestry vs. FamPCA PC1 




FamPCA PC1 
Supervised Estimated Ancestry 



European 
Atrican 

Native American 




MDS Dimension 1 



ADMIXTURE Estimated Ancestry Proportion 



Individual 



Downloaded from http://biorxiv.org/on September 18, 2014 



35 



Figure 4: Comparison of Population Structure Inference for the HapMap MXL and ASW Com- 
bined Sample 
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Tables 



Table 1: Proportion of Ancestry Explained (R 2 ) by PC-AiR and EIGEN- 
SOFT in Simulation Studies 
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a d denotes the number of axes of variation included as predictors in the linear regression 
model to determine the R 2 value for either PC-AiR or EIGENSOFT. 

b d* is the number of axes of variation from EIGENSOFT that are required to either match 
the R 2 value of the first axis of variation from PC-AiR or achieve an R 2 of 0.99, whichever 
is smaller. 
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Table 2: Genomic Control Xqc f° r Association Testing Simulation Study 



Method 
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Differentiated 
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EMMAX + 10 PCs from EIGENSOFT 


1.07 
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EMMAX + 1 PC from PC-AiR 


1.00 


1.00 


1.00 



a Highly differentiated SNPs have allele frequency differences ^ 0.4 between the two ancestral populations. 
6 Moderately differentiated SNPs have allele frequency differences < 0.4 and S= 0.2 between the two ancestral 
populations. 

c Lowly differentiated SNPs have allele frequency differences < 0.2 between the two ancestral populations. 



Table 3: Population Structure Inference Results for HapMap MXL and ASW 
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Population structure inference results from each method were compared to ancestry estimates from a supervised 
individual ancestry analysis with ADMIXTURE including reference population panels. 

a d denotes the number of axes of variation included as predictors in the linear regression model to determine the 
R 2 value for each of the methods. 

b PCA was performed with the EIGENSOFT software. 
c MDS was implemented in the PLINK software. 

d FamPCA is the Zhu et al. (2008) method as implemented in the KING (Manichaikul et al., 2010) software. 
e An unsupervised ADMIXTURE analysis was conducted without including reference population panels. FRAPPE 
results were identical to ADMIXTURE. 
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Appendix 

Under general population genetic modeling assumptions, we show that for an out bred unrelated 
pair of individuals i and j, the Kij estimator in Equation (1) is a consistent estimator for the 
quantity 

~U a i -a J ) T S i ^(a i -a,-) 



1 - \ \af E^a, + aJS^aj 
An equivalent expression of the estimator Kij from Equation (1) is 



1 / \k\^s l3 {gl-2g ls9js + g 



1%, 



^-~ 2 H-l- (A2) 



Under our population genetic modeling assumptions, the vector of subpopulation-specific allele 
frequencies, p s , has the properties E[p s ] = p s l and Cov[p s ] = p s (l — p s )T,K for all s e S. We 
assume that the ancestral allele frequencies, p s for s e S, are independent and identically distributed 
(i.i.d.) random variables from some unspecified distribution on [0, 1]. Under this assumption, the 
unconditional expectation of each of the random variables in Equation (A2) is the same for every 
choice of s £ S, and if we assume that genotypes at different SNPs are independent, then 

- 1 / E[gl]-2E[ gtsgjs ]+E[g] s ] \ 

E[l [flfa=1]]+ E[l fes=1] ] ) (A3) 

as \S\ —* oo. Note that the independence of SNPs assumption can be relaxed for Equation (A3), 
and a sufficient condition would be that the effective number of independent SNPs tends to oo. In 
what follows, we derive each of the expectations in Equation (A3) conditional on p s , and we show 
that the limiting value of Kij is the value given in Equation (Al), which does not depend on p s , 
implying that the i.i.d. assumption can also be relaxed. 

Recall that gi s is the number of copies of the reference allele that individual i has at SNP s, 
and thus gi s can have a value of 0, 1, or 2. As in Thornton et al. (2012), we define the quantity 
to be one half of the expectation of gi s , conditional on individual i's ancestry, aj, and the vector of 
subpopulation-specific allele frequencies, p s , at SNP s: 

K 



H is = ^E[g is \ai, p s ] = ajp s = ^ a\p k s . (A4) 
The quantity fJLi s can be interpreted as the individual-specific allele frequency for individual i at 
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SNP s, and it is a linear combination of the subpopulation-specific allele frequencies weighted 
by individual i's autosomal ancestry proportions from each of the ancestral subpopulations. In 
Thornton et al. (2012), both aj and p s are treated as fixed quantities. Here, we similarly treat the 
ancestry vectors as fixed, and we implicitly condition on a^ and a^- throughout what follows, but we 
allow p s to be a random vector for all s e S. Under our weak basic genetic modeling assumptions, 
we calculate the following: 



ED*,] = E[af pj = afE[p s ] = (af l)p s = p s 



(A5) 



E\jJbiafJlj s ] = E 
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^ ^ af4'E[p^f] 
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k=l k'=l 

(Psf t a * t < + t t («?<Cov[p^'] 

fc=l k'=l fc=lfc'=l 

(Ps) 2 +PsO- ~ Ps)af E^a,- 



(A6) 



The expectation E[g?,] can be obtained directly based on the genotype probabilities for indi- 
vidual i conditional on p s , where the conditional probabilities that individual i has genotype 0, 1, 
and 2 at SNP s are (1 — Hi S ) 2 , 2/^(1 — Hi s ), and fi 2 s , respectively. Thus we obtain 

E[^]=E[E[ 5 ?|p s ]] 

= E[2 Ws (l - Mis ) + 4/xfJ 
= 2E\p is ] + 2E[/&] 

= 2 Ps + 2(p s ) 2 + 2 Ps (l - p s )af E^a;. (A7) 



To obtain E[l[ g . s=1 ]], we note that the expectation of an indicator function is just the probability 
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of the event it indicates, so 



E[l[ ffis = l]]= E [ E [ 1 [ 3is = l]lP S ]] 

= E[F[g is = l|pj] 
= E[2 W ,(1- W> )] 
= 2p s (l-p s )[l-afS^ ai ]. 



(A8) 



The final expectation that we must obtain for Equation (A3) is E[gi s gj s ]. Since we are only 
considering unrelated individuals here, the genotype values gi s and gj s are independent conditional 
on the vector of subpopulation allele frequencies, so E[<?j s <?j s |p s ] = E[<?j s |p s ]E[<jfy S |p s ]. Therefore, 
we can calculate 

E[g is g js \ = E[E[g is g js \p s ]] 

= E[E[g is \p s ]E[g js \p s ]] 

= 4{ Ps ) 2 + 4p.(l - Ps )aJ E^a,- (A9) 

Finally, plugging the expectations given in Equations (A7), (A8), and (A9) into Equation (A3), we 
have 



1 
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1 f [2 - af S x a t - ajE^] - [2 + a[ S^a, - 4a[ E^- + aJSjgaj] ' 

2 y 2 - af S A 'aj - ajE^aj 
4af E^a, - 2af E^a, - 2aJSjfa,- 

4 - 2af E^a; - 2ajs^ 
-|(aj - a j ) T 5]K(aj - a.,-) 



(A10) 



1 - i[afS^ai + aJS ir a J ]' 

the limiting value of Kjj that is given in Equation (Al). Note that the limiting value is not a 
function of p s , and thus this convergence holds for any random p s . 



